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Tliis report is a product of the School Evaluation Project. As part of 
the Center's Program on Evaluation of Educational Systems, the School Eval- 
uation Project is designed to develop and field test sets of procedures 
which may be used by school evaluators and administrators engaged in eval- 
uating schools --preschool, elementary, and secondary. The project is 
atten^Jting to capitalize on the state of current knowledge to develop 
evaluation procedures which are appropriate especially to the first two 
stages of the Center's evaluation framework: Needs Assessment and Program 

Planning. The Center is concerned with developing procedures which will 
enable school principals and others to use information effectively in 
making valid decisions for improving student performance. The School 
Evaluation Project is currently field testing an evaluation KIT vAich is 
composed of a series of booklets describing hovi: to conduct a needs 
assessment of an elementary school's student output. 
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For years various professional organizations in education and 
psychology have recognized the need to set specific criteria for- 
assessment devices. However, attempts to develop such criteria have 
been, at best, timid (viz.: Technical Recommendations) . This timidi'cy 

idiere "angels dare not tread" may not be completely reprehensible; it 
is the result of several factors : 

(a) the criteria may not be equally appropriate for all types 
of measures , 

(b) the direct result of such a set of criteria would be the 
ability to evaluate critically all available assessment 
devices , 

(c) the producers of the instrimients might not be too pleased 
and, worse, might take well-reasoned issue with the criteria 
and their authors, and 

(dj the authors, being motivated primarily by altruism and 
r^'ial justice, might have to take their own inadequate, 
but lucrative, products off the market. 

PROCEDURE 

The Center for the Study of Evaluation, in order to make an equable 
appraisal of the output measures published for use in evaluating elementary 
schools, programs, and students, developed a comprehensive objectives- 
based classification of needs -assessment areas for elementary education, 
and a critical test evaluation procedure to apply to measurement de- 
vices in any of the need areas . Preparatory to the evaluations , all 
those measures presently available for elementary school evaluation 
were located. Each test or sub-scale was assigned to the pre-established 
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goal area into which it best fit. The tests were then evaluated in order 
to identify and endorse those output measures most appropriate, effective, 
and useful in assessing schools or students. The evaluation form used 
throughout the test evaluations is shown below. 

Figure 1 



MEAN TEST EVALUATION FORM 



M r»Tnr» 






Pnrrr* . 


PptPT - 1 .— 


- . , ■ --- - 


Evaluation Criteria 










Hating (circle one number in each tow) 



1. Measureir.ent Validities 
a. Content and Construct 


0 (only in name) 


2 (a few) 


4 (some) 


6 (fair job) 


8 (best available) 


10 (hit nail on 
the head) 


|m Total 1 


b. Concurrent and Predictive 


0 (none reported) 


1 (very little) 


2 (some) 


3 (not enougti) 


4 (considerable) 


5 (exhaustive) 


1 Grade | 


2. Examinee Appropriateness 
a. Comprehension: content 


inappropriate 

0 




doubtful 

1 


1 possibly appropriate 

2 


pr_babiy appropriate 
3 


exactly right 
4 


(e Total 1 

1 Grade | 


instructions 


0 


1 1 2 


3 


4 


b. Format 

1. Visual principles 


0 (complicated) 


1 (probably good) 


2 (outstanding aids) 


2. Quality of illustrations (print) 


0 (not good) 


1 (helpful) 


2 (excellent) 


3. Time and pacing 


0 (bad) 1 1 (appropriate for bread range) 


c. Recording answers 


0 (complicated) 


1 (standard) 


2 (especially easy) 


3. Administrative Usability 
a. Administration 

1. Test administration 


0 (individual) 


1 (small groups) 


2 (large groups) 




2. Training of administrators 


0 (psychometrist) 


1 (school staff) 


3. Administration 


0 (43“+- minutes) 


1 (42 minutes or less) 


b. Scoring 


0 (subjective) | 1 (difficult) | 2 (simple) 


c. Interpretation 
1. Norms 

a. Norm range 


0 (restricted) 


1 (broad) 


b. Score interpretation 


0 (uncomme>n, abstruse) 


1 (common, simple) 


c. Score conversion 


0 (complicated) { 1 (aimnie) | 2 (clear, tables) 


d. Norm groups 


0 (local, outdated, or poorly sampled) 


1 (national, well sampled) 


|a Total 1 


d. Score Interpreter 


0 (psychometrist) 


1 (scho< 


d 1 staff) 


jcrade | 


e. Can Decisions Be Made 


0 doubtful 


1 possible 


2 probable 


3 yes^ charts and graphs 


4. Nornied Technical Excellence 
a. Stability 


not reported or less than .73 
0 


.70 to .80 
1 


.80 to .90 
2 


.90-f 

3 




b. Internal Consistency 


0 


1 


2 


3 


c. Alternate form 


0 


1 


2 


3 


cL Replicability 


0 


1 


In Total n 


e. Range of Coverage 


0 nc information [ 1 floor or ceiling reached 


2 adequate | 3 more than adequate 


L.- ^ 

jCrade ] 


f. Scores 


0 poorly graduated and uncommon | 1 poorly graduated or uncotrunon | 2 well graduated and standard 
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The MEAN (an acronym for the four criterion areas to follow) evalu- 
uation procedure critically reflects four vital areas of concern to 
test users: Measurement Validity, Examinee Appropriateness, Admini- 

strative Usability, and Normed Technical Excellence. Twenty-four 
separate evaluatioris , comprising the four major criterion areas, were 
performed on 1,649 scales. These scales comprise all the output 
measures that are prepared for or are potentially useful for evaluations 
within the elementary school and that are generally a\ailable to educators 
and researchers - 

Tlie four criteria comprising the MEAN s/stem are explained below. 

Tliey v^ere meant to exhaust the breadth of interest areas of educators 
and also of educational researchers. However, the final ratings obtained 
for each test indicate its appropriateness for school evaluation settings 
rather than for clinical or research problems . 

Measurement Validity 

Evaluations on the: criterion of measurement validity were made in 
answer to the question: "Does the test appear to measure the specific 

educational objective?" (entTy 1 of Table I, -page 9.) This is 
essentially a question of content and face validity, the validities 
being keyed to the pre-established goal areas for elementary education. 
Trained evaluators were instructed to judge each test according to its 
capacity to assess the particular goal which it purported to measure or 
which a plurality of its items appeared to reflect. The judgments were 
made on the basis of careful reading of the items to determine whether 
they appeared to assess the goal and whether they proportionately assessed 



the whole range of content within the goal. Such judgments were fairly 
well structured and reliable in the content achievement areas, but were 
more difficult to make in the non-content areas of affective and cogni- 
tive behaviors . A second aspect of measurement validity concerned the 
extent of reported empirical validation, either predictive or concurrent 
Centni/ 2j Table 1) . 

Ex;aminee Appropriatenes s 

The second criterion of the MEAN evaluations was designed to assess 
how appropriate the test is for the students who will be assessed by it. 
Concern \ibs directed toward the appropriateness of the test's level of 
comprehension, its physical format, and its required response mode. 

Evaluation of the appropriateness of test content centered upon the 
difficulty of the semantic or numei'ical items and also upon the relevance 
or interest- arousing aspects of the items (entry 3^ Table 1). Similar 
criteria were applied to the test instructions since they determine 
whether or not the examinee will be able to manifest his mastery of the 
item content (entry 4^ Table 2). Instructions which appear simple to 
adults were often found to be confusing to young children. The second 
major area where appropriateness is felt to be irrportant is that of test 
format. ' The visual or auditory principles employed in test presentation 
were evaluated in terms of effective usage of Gestalt principles (entry 5, 
Table 2). The evaluators looked for specific format features such as 
sufficiency of white space between items , visual or auditory coherence of 
item stems and alternatives, and effective use of color as an aid in 
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segregating items. The general quality of illustrations and print was 
also considered under physical format (ent-x>y 6, Table 1). 

For each scale, pacing or time limits were judged for their appro- 
priateness for the subject matter and for the examinees (entry 7, Table 1) . 
Published statements regarding the speededness of tests were corroborated, 
when possible, by cor.sulting item difficulty indexes and score distributions. 

In almost all cases, power was paeferred to speed as an attribute of tests 
of educational output. The last aspect of appropriateness consi.dered was 
the mode of response recording (entry 6, Table 1). The more simple and 
direct connections betw^een the item stem and the recording of a response 
were given more credit. All aspects of examinee appropriateness were rated 
relative to the specific grade level to v^rhich the test is directed. 

Administrative Usabilit> 

After asking "What will it measure?" and "Is it designed for rny students?", 
the next question was concerned with how usable the test is in terms of 
administration, scoring, interpretation, and decision making. These aspects 
of a tes.. comprise the third criterion of the MEAN evaluations. 

It was assumed that for general assessment of educational output, a test 
that can be administered to a large group is more desirable. Small group and 
individually administered tests were judged to be less usable for evaluation 
of instructional programs (entry 9, Table 1)\ their usefulness for in-depth 
individual diagnosis was not in question. A second variable strongly 
affecting a test's utility is the training necessary to administer the test 
appropriately (entry 10, Table 2). Since few schools have resident psycho- 
metrists and since most district psychometrists focus their attentions on 
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individual student problems, a test was deemed to have greater utxlity i£ 
it could be administered by the school staff, preferab3.y by the students 
teacher. Tests were also credited if they fit into a typical class period 
and did not necessitate special scheduling (entry 11^ Table 2). 

The utility of a test is furtlier affected by the scoring procedure it 
requires (entry 12^ Table 1). Simple and objective hand or machine scoring 
of tests was considered optimal for utility; subjective scoring resulted in 
no credit. From a pragmatic viewpoint, while ease of actainistration and 
scoring are desirable, they are dwarfed by the importance of being able to 
interpret the scores and then of reaching some decision (entry 18, Table 1). 
Tests from which prescriptive decisions can be made were given greater 
credit. Comirion, simple scores for interpretation earned a test more credit. 
In addition, a broad normative sample (entry 13, Table 1) which allows for 
both high and low achievement was rated superior to a restrictive sample; 
a current and representative norming sample was also rated higher Tentri/ 16, 

Table 1). 

The normative score conversions were evaluated according to three 
criteria. If the derived scale is common and generally understood, the 
test was given more credit (entry 14, Table D- If the conversion is clear 
and unambiguous, the test earned credit over those with complicated, multi- 
stage conversions (entry 15, Table 23. These two aspects of the derived 
scores determine in part who can interpret them. Tests yielding scores 
interpretable by school staff were preferred to those demanding the skills 
of a psychometrist (entry 17, Table !)■ The final pragmatic consideration 
of a test's utility rested on whether or not decisions, either individual 
or group, can be made on the basis of information in the test manuals. 
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Normed Technical Excellence 

The last major criterion o£ the MEAN evaluation procedure was concerned 
with the reliability, replicability, and refinement of measurement of the 
tests. Reliability was evaluated separately for published reports of 
test-retest (entry 19j, Table 1) , internal -consistency (entry 20, Table 1) , 
and alternate -form estimates (entry 21, Table 1) . Closely related to the 
concept of test reliability is that of replicability of procedures to obtain 
the scores (entry 22, Table 1). If procedures described in the test manual 
are complicated, subjective, and based upon abnormal samples, the test is 
clearly not replicable. Replicable procedures for obtaining scores were 
judged as more valuable. 

The range of coverage is also an important aspect of a test's technical 
excellence. A broad developmental range Vfhich is appropriate for one level 
of assessment but which can also be applied to students above and below that 
level was preferred to a restrictive range (entry 23, Table 1). Related to 
the range problem is the refinement or gradation of the inter- individual 
comparison scores; the finer the gradation, the better the evaluation of the 
test (entry 24, Table 1). 

ANALYSIS 

Each of the tests and scales, then, earned four scores; one for each of 
the MEAN criteria. These scores and their bases are published in Hoepfner, 
Strickland, Stangel, Jansen, and Patalino (1970) in greater detail. The 
four MEAN scores were, however, based upon twenty-four individual judgments. 
These discrete judgments were factor analyzed in order to micover the 
characteristics of tests which actually do cohere. Table 1 presents the 
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twenty- four criteria, the range of points possible for each of their 
evaluations, and the means of the consensual judgments for grades 1, 

3, 5, and 6. 

The separate judgments for each of the scales within each of the 
four grade levels were submitted to a principal-axes factor analysis. 

Initial solutions showed that only four factors appeared with regu- 
larity in all four grade levels. Because a fifth factor only appeared 
in two of the solutions (not chronologi.call,y adjacent grade- lev sis j , 
communality itentions were based on f our factors . The matrices of 
interrcorrelations among the rated characreristics are presented in 
Tables 2 through 5. ITie varimax factor loadings for the four factors 
and for the four grade levels are presented in Table 6. 

RESULTS 

Mean ratings of evaluative test qualities, as presented in Table 1, 
indicate no significant trends of increased or decreased quality over the 
four grade levels . One of the most salient findings in Table 1 is the 
relatively higher reliability estimate obtained through internal -consistency 
techniques. Whether or not this is an artifact of the ease of its estima- 
tion or the vulnerability of such estimates to extraneous inflationary factors 
cannot be determined. 

It can also be seen from Table 1 that publishers provide very little 
evidence for the concurrent and predictive validities of their tests in the 
manuals they provide. This reflects, of course, the great costs to the 
publisher of such studies and the necessary delay from the time the manual 
is published to the time that various independent research findings can 
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Note: Decimal points omitted; commuality estimates in diagonal. 



Intercorrelations among 24 Ratings made on 477 
Tests at the Fifth- Grade Level 
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Note: Decimal points omitted; communality estimates in diagonal. 



Intercorrelations among 24 Ratings made on 508 
Tests at the Sixth Grade Level 
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become incorporated into the publishers documentation (if, indeed it ever 
is) . Nonetheless , the typical rating on this criterion can be described 
as "very little evidence.” 

The comprehension levels o£ test items and instructions appears 
rather satisfactory, all means falling above the "probably appropriate” 
rating. This reflects the fact that most instruments at the elementary 
level are developed by curriculum experts at each grade level. Time and 
pacing and response recording procedures are also rated highly, probably 
for the same reason. 

The visual principals and quality of illustrations for tests are 
rated at only slightly above average. Such mediocrity may be due to the 
expense of good graphics and layout or may be the result of a deliberate 
attempt by some publishers to avoid producing too polished a product (that 
might appear more commercial than educational) . 

The tests, major shortcomings in the area of Administrative Usability 
are the low quality of norm-group sampling and the failure to provide pre- 
scriptive decision rules on the basis of test results. Maintaining norm 
currency and obtaining national representativeness o£ the norm groups is the 
most expensive aspect of test publishing, and so it is not suprising that 
norms lack these qualities. Definitive and prescriptive decision rules 
violate the often repeated (and frequently justified) warnings against too 
literal and decisive inteip)retations from faulty test scores. It seems that 
in following these well-intentioned warnings, the publishers make their 
instruments less useful for most educators who cannot operate with the ambig- 
uous decision-making data provided for them. 
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^Vhile it is difficult to draw conclusions from the massive amounts 
of data provided in the correlation matrices (Tables 2 through 5) , the 
outstanding finding is the relative lack of correlation between the ra- 
tings on the two kinds of test validity. The correlations between the 
ratings of face-content and concurrent -predictive validities range from 
-13 to +12, clearly demonstrating their independence, not only as con- 
structs, but as results of actual practice in test construction and de- 
velopment . 

The varimax solutions in Table 6 evidence considerable factorial 
invariance over the four grade levels . The fact that some instruments 
were common to more than one solution, being appropriate for a large 
grade span, cannot be hypothesized as accounting for this invariance, as 
there were few such overlapping instruments and the test evaluations were 
made separately at each grade level. 

Factor A, consistently led by the variables of Test Administration, 
Training of Administrators, Score Interpreter, Scoring, and Replicability, 
clearly reflects a "Usability” dimension upon which tests can be plac :.-l. 

While not the same as the MEAN criterion of administrative usability, it is 
related as four of the eight variables having significant loadings are com- 
ponents of this criterion. It is interesting to note the consistent negative 
loadings for the Examinee Appropriateness ratings, especially for Visual 
Principals and Quality of Illustrations; perhaps this indicates that increased 
efforts to make tests usable have resulted in decreased attempts at making 
tests appropriate for the examinees . 
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Factor B is consistently led by the variables of Range of Coverage, 
Gradation of Scores, Norm Range, Score Interpretation, Score Conversion, 
and Internal-Consistency Reliability. This constellation of test 
attributes is named the "Norm Quality” factor, implying that normed tests 
tend to be good or bad in most of the noiming attributes . 

Factor C is led in all four grade levels by the variables of Ability 
to Make Dec .sions. Content and Construct Validity, and Content Comprehension. 
The factor probably reflects the amount of specificity of coverage of a 
test; tests being directed specifically to some focal goal area scored 
higher on these criteria. For this reason. Factor C is called the "Focus” 
factor . 

Factor D is led by the variables of Concurrent and Predictive Validity, 
Norm Representativeness, and Test-Retest Reliability. In several of the 
grade levels, the factor is further supported by the variables of Intemal- 
Consistency and Alternate-Form reliabilities. This factor is a parallel to 
Factor B and is called the "Psychometric Quality" factor. Apparently, pub- 
lishers either exhaustively analyze their tests on all psychometric 
criteria, tend not to analyze on any of the criteria, or seek some consistent 
level of psychometric analysis. 



CONCLUSIONS 

Mean ratings of evaluations of tests, as presented in Table 1, indicate 
major shortcomings that characterize today's published instruments for 
elementary education. A factor analysis of these ratings revealed four 
consistent dimensions upon which tests actually vary: Usability, Norm Quality, 
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Focus, and Psychometric Quality. The results of this test of tests should 
have many immediate and long-term implications for the in^rovement of 
assessment instrumentation by pointing out rather clearly the shortcomings 
that characterize today's published tests. 
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